BoostingΒΆ

Try again. Fail again. Fail better. Samuel Beckett.

  1. I Contain Multitudes
    1.1 Decision Trees
    1.2 A Song of Bias and Variance
    1.3 Bagging vs Boosting
  2. The Strength of Weak Learners
    2.1 A short trip to Classtown
    2.2 The weakest of learners
  3. The AdaBoost Algorithm:
    3.1 Decision Stumps and Decision Boundaries
    3.2 Oops, I did it again: AdaBoost Intuition
    3.3 The Algorithm
  4. Gradient Boosting
    4.1 All I Want for Boosting is Gradient
  5. AdaBoost in Practice

1. I Contain MultitudesΒΆ

multitudes.png

1.1 Decision TreesΒΆ

decision_tree.png

1.1 Decision TreesΒΆ

Define impurity measure, e.g Gini impurity index: $G(N) = \sum_{k}p_k(1-p_k)$.

gini_binary.png

Algorithm time complexity: O(?)

No description has been provided for this image

1.2 A song of Bias and VarianceΒΆ

bias_variance.png

1.2 A song of Bias and VarianceΒΆ

  • $Y=f(\bf{X}) + \epsilon$: true relationship between Y observations and X covariates.
  • $\epsilon$: Gaussian noise with zero mean and standard deviation $\sigma_\epsilon$.
  • $\hat{f}(\bf(x))$: model. $$ Err(x)=E[(Y - \hat{f}(x))^2] $$

$$ Err(x)=(E[\hat{f}(x)] - f(x))^2 + E[(\hat{f}(x) - E[\hat{f}(x)])^2] + {\sigma_{\epsilon}}^2 $$

$$ Err(x)=Bias^2 + Variance + irreducible\quad error $$

1.3 Bagging vs BoostingΒΆ

bagging_boosting-3.png

2. The strength of weak learnersΒΆ

Hypothesis Boosting Problem (Michael Kerns, 1988)

Does an efficient learning algorithm that outputs an hypothesis whose performance is only slightly better than random guessing implies the existence of an efficient algorithm that outputs an hypothesis of arbitrary accuracy?

weak_to_strong_learner.webp

2.1 A short Trip to ClasstownΒΆ

  • Destination: $\{(x_1, 1),(x_2, 1),(x_3, -1)\}$
  • labels form a point in $\mathbb{R}^3$
  • movement hypothesis $h$: $(h(x_1), h(x_2), h(x_3))$
  • total movement: $H=\sum_\limits{t}\alpha_th_t(x)$

function_space.png

2.2 Does it actually work?ΒΆ

boosting_intuition.gif

3.1 AdaBoost: Decision Stumps and Decision BoundariesΒΆ

Yes, a weak learner can be boosted into a strong one (Schapire, 1990).

Decision stump: a decision tree with only one split.
Decision boundary: line separating classes in a classification problem.

No description has been provided for this image

3.2 Oops, I did it again: AdaBoost IntuitionΒΆ

  1. Train a decision stump that learns a model to ensure that training examples with higher weights are prioritized.
  2. Update the weights of the training examples such that misclassified examples are assigned higher weights; the worse the error, the higher the weight.

adaboost_iteration0.png

adaboost_iteration1.png

adaboost_iteration2.png

adaboost_ensemble.png

3.3 AdaBoost: the AlgorithmΒΆ

  1. Train a weak learner $h_t (x)$ using the weighted training examples, $(x_i,y_i,D_i)$

    • Compute the training error $\epsilon_t$ of the weak learner $h_t(x)$
    • Compute the weight of the weak learner $\alpha_t$ that depends on $\epsilon_t$.
    • $\epsilon_t=\sum_\limits{i:y_i\neq h(x_i)}D_i$
  2. Update the weights of the training examples

    • $D_i e^{-\alpha_ty_{i}h(x_i)}$

How is the weight changing for examples correctly classified? How for wrong ones?

The overall classifier after t iterations is just a weighted ensemble:

$$ H(x)= \sum_{t=1}^T \alpha_t h_t (x). $$

No description has been provided for this image

4.1 All I want for Boosting is GradientΒΆ

Branin function as a test function to visualize gradient descent. The Branin function is a function of two variables $w_1$ and $w_2$:

$$ f(w_1, w_2) = a (w_2 - b w_1^2 + c w_1 - r)^2 + s (1-t) \cos{w_1} + s $$

where a = 1, b = 5.1/4Ο€2, c = 5/Ο€, r = 6, s = 10, and t = 1/8Ο€ are fixed constants.

No description has been provided for this image
No description has been provided for this image

4.2 Gradient Boosting IntuitionΒΆ

  • Loss function $L$
  • Model at step $t$: $H_t(x)=H_{t-1}(x) + \alpha_t h_t (x)$
  • Choose $h_t(x)$ to be the closest to the direction of the negative gradient of $L$: $-\frac{\partial L}{\partial H}\big|_{H=H_{t-1}(x)}$, i.e. the model error at time $t-1$

boosting_vs_gradient_boosting.png

4.3 AdaBoost vs Gradient BoostingΒΆ

adaboost_vs_gradient_boosting.png

ResourcesΒΆ

  • Ensemble Methods for Machine Learning. Gautam Kunapuli
  • Ensemble Methods for Machine Learning, GitHub companion repo
  • CS4780 Cornell University, Kilian Weinberger
  • Probably Approximately Correct Learning
  • CS4780 on YouTube, Kilian Weinberger
  • Hypothesis Boosting Problem. (M. Kerns, 1988
  • Boosting: Foundation and Algorithms. R. Schapire
  • Greedy Function Approximation:A gradient boosting machine. J.H Friedman
  • More on Grogu